-
Notifications
You must be signed in to change notification settings - Fork 11.6k
ggml: Implement yield barrier using futex for improved thread scheduling efficiency #13079
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
2fbabd7
to
73d33b0
Compare
Hi, I would like to ask for your opinion regarding the use of futex-based yield barriers versus traditional spin barriers. Would appreciate your thoughts on whether a yield-based approach is a good fit for GGML’s threading model on mobile devices or servers under heavy load. Thank you for your consideration! |
Hi, sorry for taking so long to respond to this. I think this is very interesting and definitely something that we should working towards, but as it is, the performance hit during generation is in my opinion too high for this to be useful in this state. Ideally, this should be something that is always enabled, and is triggered automatically after spinning for a while. I understand that the code already does this, so I wonder if it is a matter of tuning. If I am not mistaken, the gcc openmp implementation does something like this as well. Hiding this behind a compile flag that is disabled by default, is likely to result in this being dead code that very few people are going to use. On a less important note, I was not able to replicate the results on an M3 Max. In my tests, this was always slower.
|
You're right, GCC’s OpenMP implementation does something similar. For reference, here are a few relevant links: I've also implemented a check for the number of affinity cores to avoid unnecessary spinning — particularly helpful on processes limited by
Regarding your M3 Max(12P+4E I guess) results:
Completely agree. The goal is absolutely to make Below is a set of benchmark results from my M3 Pro(5P+6E). It shows that pp512 and pp256 consistently benefit from yield_barrier, while tg128 and tg64 performance drops, especially with higher thread counts — supporting the idea that automatic tuning (e.g., adjusting threads per phase) might be a better long-term solution. According to your results, even with the spin policy, the best performance is achieved with 8 threads, not more.
|
Wouldn't this be the same that is already implemented for mul_mat and mul_mat_id? llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c Line 1456 in 3b127c7
However, this is not supported when repacking Q4_0, since it uses a different implementation of the matrix multiplication functions. llama.cpp/ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp Line 6058 in 3b127c7
|
Ah, I see — I missed this part, you're right. That does implement a similar chunk-level scheduling mechanism. So the performance regressions I’m seeing are probably due to the spin count not being tuned well — likely due to the need for a smarter, adaptive spin-wait threshold to reduce the cost of falling back to futex() syscalls. I'll do more testing. |
Description:
This PR replaces the original spin-based barrier in GGML with a futex-based yield barrier to improve thread scheduling efficiency and overall system performance.
Currently, the feature can be controlled using the CMake parameter
GGML_YIELD_BARRIER
, allowing users to enable or disable the yield barrier as needed.Key Benefits:
Improved Scalability
The futex-based barrier allows threads to yield instead of busy-waiting. This reduces CPU waste and improves scalability when the number of threads exceeds the number of physical cores, or when other workloads are competing for CPU time.
Better Performance on Hybrid Architectures
On systems with heterogeneous cores (e.g., big.LITTLE or Intel Hybrid Architecture), yielding helps critical threads get scheduled on performance cores, potentially improving throughput (e.g., PP performance in multi-threaded inference).
Power Efficiency and Thermal Stability
By avoiding unnecessary spinning, this change can reduce power consumption and help maintain higher sustained performance, especially on thermally constrained devices. It may also mitigate CPU throttling under load.
Benchmark:
based on build: 42eb248 (5025)
Apple M1 (4P+4E) (disable Accelerate framework and Metal)
before
after:
Apple M3 Pro (5P + 6E) (disable Accelerate framework and Metal)
before:
after:
Apple M4 (compile on M1 native)
before
after:
before:
after:
Snapdragon 888 (X1 + A78x3 + A55x4)
before:
after:
before:
after:
Snapdragon 6Gen1 (A78x4 + A55x4)
before:
after:
before:
after:
Ryzen 9950X (light thermal throttling observed)
before:
after:
before:
after:
Ryzen 9950X (spin-based bottleneck: threads > cores)
before:
after:
Conclusion:
Across most tested devices, the
pp512
workload consistently benefits from the futex-based yield barrier, showing noticeable throughput improvements. This is especially evident on high-core-count or hybrid-core systems, where reduced spinning improves scheduling fairness and efficiency.However, for
tg128
— which is typically less compute-intensive and more sensitive to load imbalance — performance may degrade slightly in some cases. This is likely due to the lower thread saturation and increased context switching overhead introduced by yielding, which affects lighter workloads more noticeably.